Add new script hocr-cut for cutting a page #108

zuphilip · 2017-03-18T10:11:01Z

This cuts a page (horizontally) into two pages in the middle
such that the most of the bounding boxes are separated nicely,
e.g. cutting double pages or double columns.

For example this double pages

is cut in the middle and outputs a left and right page

The whole computation is based on the bounding boxes, and therefore needs the input of some OCR or layout segmentation process. But it might be possible to OCR the individual pages afterwards again to receive better results then (e.g. skewing might be more consistent along one page compared to a double page).

This cuts a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.

Signed-off-by: Stefan Weil <[email protected]>

It was fixed using `yapf -i --style pep8 hocr-cut`. Signed-off-by: Stefan Weil <[email protected]>

Tesseract uses image names enclosed in "" which must be stripped because otherwise opening the image will fail. Signed-off-by: Stefan Weil <[email protected]>

kba

IMHO useful to include in master.

stweil · 2018-09-07T14:17:58Z

Done. Thank you, Philipp and Konstantin, for the contribution and the review.

stweil · 2018-09-07T14:20:31Z

Should we tag a new release based on master? 1.3.0?

stweil · 2018-09-07T14:21:50Z

The script could be extended to create two new hOCR files for left and right page, too.

zuphilip · 2018-09-07T18:50:36Z

New release sounds good, but there is already one drafted. Sorry forgot about this. Maybe we can do two new releases 1.2.1 and 1.3.0?

Improving the script sounds fine, also I expect that after cutting a double page into two single pages, it might be better to run OCR on each of those again.

stweil · 2018-09-07T18:56:07Z

Let's start with 1.2.1, then create 1.3.0.

Running OCR again on the single pages is reasonable, but can cost a lot of resources if many pages have to be processed, so separated hOCR from the initial double pages can be desired in certain situations.

zuphilip changed the title ~~Add new script hocr-cut for cutting a pages~~ Add new script hocr-cut for cutting a page Mar 18, 2017

zuphilip added the enhancement label Mar 18, 2017

zuphilip mentioned this pull request Jul 17, 2017

WIP: Reformat hocr-* code according to PEP8 #116

Merged

zuphilip and others added 4 commits September 5, 2018 07:44

Add new script hocr-cut for cutting a pages

03be4ef

This cuts a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.

[hocr-cut]: Handle case that image is not present

3b9d343

hocr-cut: Fix whitespace issues

48c93de

Signed-off-by: Stefan Weil <[email protected]>

hocr-cut: Set executable mode for file

ac36c0c

Signed-off-by: Stefan Weil <[email protected]>

stweil force-pushed the split-pages branch from 01fdd58 to ac36c0c Compare September 5, 2018 05:44

stweil added 2 commits September 5, 2018 07:50

hocr-cut: Fix PEP8 style

f70b28f

It was fixed using `yapf -i --style pep8 hocr-cut`. Signed-off-by: Stefan Weil <[email protected]>

hocr-cut: Strip "" from image name

3a91a35

Tesseract uses image names enclosed in "" which must be stripped because otherwise opening the image will fail. Signed-off-by: Stefan Weil <[email protected]>

kba approved these changes Sep 7, 2018

View reviewed changes

stweil merged commit adb810c into ocropus:master Sep 7, 2018

stweil deleted the split-pages branch September 7, 2018 14:16

stweil self-assigned this Sep 7, 2018

zuphilip mentioned this pull request Sep 7, 2018

Document hocr-cut in README #132

Closed

zuphilip mentioned this pull request Sep 2, 2019

Output hocr files besides images in hocr-cut #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new script hocr-cut for cutting a page #108

Add new script hocr-cut for cutting a page #108

zuphilip commented Mar 18, 2017 •

edited by stweil

Loading

kba left a comment

stweil commented Sep 7, 2018

stweil commented Sep 7, 2018

stweil commented Sep 7, 2018

zuphilip commented Sep 7, 2018

stweil commented Sep 7, 2018

Add new script hocr-cut for cutting a page #108

Add new script hocr-cut for cutting a page #108

Conversation

zuphilip commented Mar 18, 2017 • edited by stweil Loading

kba left a comment

Choose a reason for hiding this comment

stweil commented Sep 7, 2018

stweil commented Sep 7, 2018

stweil commented Sep 7, 2018

zuphilip commented Sep 7, 2018

stweil commented Sep 7, 2018

zuphilip commented Mar 18, 2017 •

edited by stweil

Loading